A Turkish-German Code-Switching Corpus
نویسنده
چکیده
Bilingual communities often alternate between languages both in spoken and written communication. One such community, Germany residents of Turkish origin produce Turkish-German code-switching, by heavily mixing two languages at discourse, sentence, or word level. Code-switching in general, and Turkish-German code-switching in particular, has been studied for a long time from a linguistic perspective. Yet resources to study them from a more computational perspective are limited due to either small size or licence issues. In this work we contribute the solution of this problem with a corpus. We present a Turkish-German code-switching corpus which consists of 1029 tweets, with a majority of intra-sentential switches. We share different type of code-switching we have observed in our collection and describe our processing steps. The first step is data collection and filtering. This is followed by manual tokenisation and normalisation. And finally, we annotate data with word-level language identification information. The resulting corpus is available for
منابع مشابه
A Code-Switching Corpus of Turkish-German Conversations
We present a code-switching corpus of Turkish-German that is collected by recording conversations of bilinguals. The recordings are then transcribed in two layers following speech and orthography conventions, and annotated with sentence boundaries and intersentential, intrasentential, and intra-word switch points. The total amount of data is 5 hours of speech which corresponds to 3614 sentences...
متن کاملPart of Speech Annotation of a Turkish-German Code-Switching Corpus
In this paper we describe our efforts on POS annotation of a code-switching corpus created from Turkish-German tweets. We use Universal Dependencies (UD) POS tags as our tag set. While the German parts of the corpus employ UD specifications, for the Turkish parts we propose annotation guidelines that adopt UD’s language-general rules when it is applicable and adapt its principles to Turkishspec...
متن کاملDetecting Code-Switching in a Multilingual Alpine Heritage Corpus
This paper describes experiments in detecting and annotating code-switching in a large multilingual diachronic corpus of Swiss Alpine texts. The texts are in English, French, German, Italian, Romansh and Swiss German. Because of the multilingual authors (mountaineers, scientists) and the assumed multilingual readers, the texts contain numerous code-switching elements. When building and annotati...
متن کاملMultilingual Semantic Parsing And Code-Switching
Extending semantic parsing systems to new domains and languages is a highly expensive, time-consuming process, so making effective use of existing resources is critical. In this paper, we describe a transfer learning method using crosslingual word embeddings in a sequence-tosequence model. On the NLmaps corpus, our approach achieves state-of-the-art accuracy of 85.7% for English. Most important...
متن کاملA Mandarin-English Code-Switching Corpus
Generally the existing monolingual corpora are not suitable for large vocabulary continuous speech recognition (LVCSR) of codeswitching speech. The motivation of this paper is to study the rules and constraints code-switching follows and design a corpus for code-switching LVCSR task. This paper presents the development of a Mandarin-English code-switching corpus. This corpus consists of four pa...
متن کامل